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Abstract 

Probabilistic  forecasts  of  a  continuous  variable  take  the  form  of  predictive  densities  or  pre¬ 
dictive  cumulative  distribution  functions.  We  propose  a  diagnostic  approach  to  the  evaluation 
of  predictive  performance  that  is  based  on  the  paradigm  of  maximizing  the  sharpness  of  the 
predictive  distributions  subject  to  calibration.  Calibration  refers  to  the  statistical  consistency 
between  the  distributional  forecasts  and  the  observations  and  is  a  joint  property  of  the  predic¬ 
tions  and  the  events  that  materialize.  Sharpness  refers  to  the  concentration  of  the  predictive 
distributions  and  is  a  property  of  the  forecasts  only.  A  simple  game-theoretic  framework  allows 
us  to  distinguish  probabilistic  calibration,  exceedance  calibration  and  marginal  calibration.  We 
propose  and  study  tools  for  checking  calibration  and  sharpness,  among  them  the  probability  in¬ 
tegral  transform  (PIT)  histogram,  marginal  calibration  plots,  the  sharpness  diagram  and  proper 
scoring  rules.  The  diagnostic  approach  is  illustrated  by  an  assessment  and  ranking  of  proba¬ 
bilistic  forecasts  of  wind  speed  at  the  Stateline  wind  energy  center  in  the  US  Pacific  Northwest. 
In  combination  with  cross-validation  or  in  the  time  series  context,  our  proposal  provides  very 
general,  nonparametric  alternatives  to  the  use  of  information  criteria  for  model  diagnostics  and 
model  selection. 

Keywords:  Cross-validation;  Density  forecast;  Ensemble  prediction  system;  Forecast  verifi¬ 
cation;  Model  diagnostics;  Posterior  predictive  assessment;  Predictive  distribution;  Prequential 
principle;  Probability  integral  transform;  Proper  scoring  rule 


1  Introduction 

A  major  human  desire  is  to  make  forecasts  for  the  future.  Forecasts  characterize  and  reduce  but 
generally  do  not  eliminate  uncertainty.  Consequently,  forecasts  should  be  probabilistic  in  nature, 
taking  the  form  of  probability  distributions  over  future  events  (Dawid  1984).  Indeed,  over  the  past 
two  decades  the  quest  for  good  probabilistic  forecasts  has  become  a  driving  force  in  meteorology. 
Major  economic  forecasts  such  as  the  quarterly  Bank  of  England  inflation  report  are  issued  in  terms 
of  predictive  distributions,  and  the  rapidly  growing  area  of  financial  risk  management  is  dedicated 
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to  probabilistic  forecasts  of  portfolio  values  (Duffie  and  Pan  1997).  In  the  statistical  literature, 
advances  in  Markov  chain  Monte  Carlo  methodology  (see,  for  example,  Besag,  Green,  Higdon  and 
Mengersen  1995)  have  led  to  explosive  growth  in  the  use  of  predictive  distributions,  mostly  in  the 
form  of  Monte  Carlo  samples  from  the  posterior  predictive  distribution  of  quantities  of  interest. 

It  is  often  crucial  to  assess  the  predictive  ability  of  forecasters,  or  to  compare  and  rank  compet¬ 
ing  forecasting  methods.  Atmospheric  scientists  talk  of  forecast  verification  when  they  refer  to  this 
process  (Jolliffe  and  Stephenson  2003),  and  much  of  the  underlying  methodology  has  been  developed 
by  meteorologists.  There  is  also  a  relevant  strand  of  work  in  the  econometrics  literature  (Diebold 
and  Mariano  1995;  Christoffersen  1998;  Diebold,  Gunther  and  Tay  1998).  Murphy  and  Winkler 
(1987)  proposed  a  general  framework  for  the  evaluation  of  point  forecasts  that  uses  a  diagnostic  ap¬ 
proach  based  on  graphical  displays,  summary  measures  and  scoring  rules.  In  this  paper,  we  consider 
probabilistic  forecasts  (as  opposed  to  point  forecasts)  of  continuous  and  mixed  discrete-continuous 
variables,  such  as  temperature,  wind  speed,  precipitation,  gross  domestic  product,  inflation  rates 
and  portfolio  values.  In  this  situation,  probabilistic  forecasts  take  the  form  of  predictive  densities 
or  predictive  cumulative  distribution  functions,  and  the  diagnostic  approach  faces  a  challenge,  in 
that  the  forecasts  take  the  form  of  probability  distributions  while  the  observations  are  real- valued. 

We  consider  a  simple  game-theoretic  framework  for  the  evaluation  of  predictive  performance.  At 
times  t  =  1,2,...,  nature  chooses  a  distribution,  Gt,  which  we  think  of  as  the  true  data  generating 
process,  and  the  forecaster  chooses  a  probabilistic  forecast  in  the  form  of  a  predictive  cumulative 
distribution  function,  T).  The  observation,  xt,  is  a  random  number  with  distribution  Gt-  If 

Ft  =  Gt  for  all  t.  (1) 

we  talk  of  a  perfect  forecaster.  In  practice,  the  true  distribution,  Gt,  remains  hypothetical,  and  the 
predictive  distribution,  T),  is  understood  as  an  expert  opinion  which  may  or  may  not  derive  from 
a  statistical  prediction  algorithm.  In  accordance  with  Dawid’s  (1984)  prequential  principle,  the 
predictive  distributions  need  to  be  assessed  on  the  basis  of  the  forecast-observation  pairs  ( Ft,xt ) 
only,  irrespective  of  their  origins.  Dawid  (1984)  and  Diebold,  Gunther  and  Tay  (1998)  proposed 
the  use  of  the  probability  integral  transform  or  PIT  value, 


pt  =  Ft(xt),  (2) 

for  doing  this.  If  the  forecaster  is  perfect  and  Ft  is  continuous,  then  pt  has  a  uniform  distribu¬ 
tion.  Hence,  the  uniformity  of  the  probability  integral  transform  is  a  necessary  condition  for  the 
forecaster  to  be  perfect,  and  checks  for  its  uniformity  have  formed  a  cornerstone  of  forecast  eval¬ 
uation,  particularly  in  econometrics  and  meteorology.  In  the  classical  time  series  framework,  each 
Ft  corresponds  to  a  one-step  ahead  forecast,  and  checks  for  the  uniformity  of  the  probability  inte¬ 
gral  transform  have  been  supplemented  by  checks  for  its  independence  (Friihwirth-Schnatter  1996; 
Diebold  et  al.  1998). 

Harnill  (2001)  gave  a  thought-provoking  example  of  a  forecaster  for  whom  the  histogram  of  the 
PIT  values  is  essentially  uniform,  even  though  every  single  probabilistic  forecast  is  biased.  His 
example  aimed  to  show  that  the  uniformity  of  the  PIT  values  is  a  necessary  but  not  a  sufficient 
condition  for  the  forecaster  to  be  perfect.  To  fix  the  idea,  we  consider  a  simulation  study  based 
on  the  scenario  described  in  Table  1.  At  times  t  =  1,2,...,  nature  chooses  the  distribution  Gt  = 
N(pt,  1)  where  pt  is  standard  normal.  In  the  context  of  weather  forecasts,  we  might  think  of  pt  as 
an  accurate  description  of  the  latest  observable  state  of  the  atmosphere.  The  perfect  forecaster  is 
an  expert  meteorologist  who  conditions  on  the  current  state,  pt,  and  issues  a  perfect  probabilistic 
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Table  1:  Scenario  for  the  simulation  study.  At  times  t  =  1,2,...,  nature  chooses  a  distribution, 
Gt,  the  forecaster  chooses  a  probabilistic  forecast,  Ft,  and  the  observation  is  a  random  number,  xt , 
with  distribution  Gt-  We  write  A f(p,  a2)  for  the  normal  distribution  with  mean  p  and  variance  a2, 
and  we  identify  distributions  and  cumulative  distribution  functions,  respectively.  The  sequences 
(Tt)t= 1,2,...  and  (St,  a2 )t=i,2,...  are  independent  identically  distributed  and  independent 
of  each  other. 


Nature 

Gt  =  A f(pt,  1)  where  pt  ~  AT(0, 1) 

Perfect  forecaster 

Ft  =  A f(pt,  1) 

Climatological  forecaster 

Ft  =  AA(0,2) 

Unfocused  forecaster 

Ft=\  (A f(pt,  1)  +  A/'(yU.t  +  Tt,  1)) 
where  Tt  =  FI  with  probability  \  each 

Hamill’s  forecaster 

Ft  =  A f(pt  +  St,??) 

where  (St,  of)  =  (\,  1),  (— 5, 1)  or  (0,  with  probability  ^  each 

forecast,  Ft  =  Gt .  The  climatological  forecaster  takes  the  unconditional  distribution,  Ft  =  A/"(0,2), 
as  probabilistic  forecast.  The  unfocused  forecaster  observes  the  current  state,  nt,  but  adds  a 
mixture  component  to  the  forecast,  which  can  be  interpreted  as  distributional  bias.  A  similar 
comment  applies  to  Hamill’s  forecaster.  Clearly,  our  forecasters  are  caricatures  of  operational 
weather  forecasters;  yet,  climatological  reference  forecasts  and  conditional  biases  are  frequently 
observed  in  practice.  For  simplicity,  we  assume  that  the  states,  pt,  are  independent.  Extensions  to 
serially  dependent  states  are  straightforward  and  will  be  discussed  below.  The  observation,  xt,  is  a 
random  draw  from  Gt,  and  we  repeat  the  prediction  experiment  10000  times.  Figure  1  shows  that 
the  PIT  histograms  for  the  four  forecasters  are  essentially  uniform;  furthermore,  the  PIT  values 
are  independent,  and  this  remains  true  under  serially  dependent  states,  with  the  single  exception 
of  the  climatological  forecaster.  The  respective  sample  autocorrelation  functions  are  illustrated  in 
Figures  2  and  3. 

In  view  of  the  reliance  on  the  probability  integral  transform  in  the  extant  literature,  this  is 
a  disconcerting  result.  As  Diebold,  Gunther  and  Tay  (1998)  pointed  out,  the  perfect  forecaster 
is  preferred  by  all  users,  regardless  of  the  respective  loss  function.  Yet,  the  probability  integral 
transform  is  unable  to  distinguish  between  the  perfect  forecaster  and  her  competitors.  To  address 
these  limitations,  we  propose  a  diagnostic  approach  to  the  evaluation  of  predictive  performance 
that  is  based  on  the  paradigm  of  maximizing  the  sharpness  of  the  predictive  distributions  subject 
to  calibration.  Calibration  refers  to  the  statistical  consistency  between  the  distributional  forecasts 
and  the  observations,  and  is  a  joint  property  of  the  predictions  and  the  observed  values.  Sharpness 
refers  to  the  concentration  of  the  predictive  distributions  and  is  a  property  of  the  forecasts  only. 
The  more  concentrated  the  predictive  distributions,  the  sharper  the  forecasts,  and  the  sharper  the 
better,  subject  to  calibration. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  develops  our  game-theoretic 
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Perfect  Forecaster 


Climatological  Forecaster 


Probability  Integral  Transform 


Probability  Integral  Transform 


Figure  1:  Probability  integral  transform  (PIT)  histograms. 


framework  for  the  assessment  of  predictive  performance.  We  introduce  the  notions  of  probabilistic 
calibration,  exceedance  calibration  and  marginal  calibration,  give  examples  and  counterexamples, 
and  discuss  a  conjectured  sharpness  principle.  In  Section  3,  we  propose  diagnostic  tools  such  as 
marginal  calibration  plots  and  sharpness  diagrams  that  complement  the  PIT  histogram.  Proper 
scoring  rules  address  calibration  as  well  as  sharpness  and  allow  for  the  ranking  of  competing  forecast 
procedures.  Section  4  turns  to  a  case  study  on  probabilistic  forecasts  at  the  Stateline  wind  energy 
center  in  the  US  Pacific  Northwest.  The  diagnostic  approach  yields  a  clear-cut  ranking  of  statistical 
algorithms  for  forecasts  of  wind  speed,  and  suggests  forecast  improvements  that  can  be  addressed  in 
future  research.  Similar  approaches  hold  considerable  promise  as  very  general,  nonparametric  tools 
for  statistical  model  selection  and  model  diagnostics.  The  paper  closes  with  a  discussion  in  Section 
5  that  emphasizes  the  need  for  routine  assessments  of  sharpness  in  the  evaluation  of  predictive 
performance. 

2  Modes  of  calibration 

We  consider  probabilistic  forecasting  as  a  game  played  between  nature  and  the  forecaster.  At  times 
or  instances  t  =  1,2,...,  nature  chooses  a  distribution,  Gt ,  and  the  forecaster  chooses  a  probabilistic 
forecast  in  the  form  of  a  predictive  cumulative  distribution  function,  Ft .  The  observation,  xt,  is  a 
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Figure  2:  Sample  autocorrelation  functions  for  the  probability  integral  transform. 
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Figure  3:  Same  as  Figure  2  except  that  the  states,  fit ,  are  now  serially  dependent,  following  a 
stationary  Gaussian  autoregression  of  order  1  with  autoregressive  parameter  }>. 
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random  number  with  distribution  Gt ■  For  simplicity,  we  suppose  that  Ft  and  Gt  are  continuous  and 
strictly  increasing  on  R.  In  this  framework,  calibration  refers  to  the  asymptotic  compatibility  of  the 
sequences  (Gt)t= 1,2,...  and  (Ft)t= 1,2,...,  which  correspond  to  the  data  generating  mechanism  and  to 
the  forecasts,  respectively.  Our  approach  seems  slightly  broader  than  Dawid’s  (1984)  prequential 
framework,  since  we  think  of  (Ff)t=  1,2,...  as  a  general,  countable  sequence  of  forecasts,  with  the 
index  refering  to  time,  space  or  subjects,  depending  on  the  prediction  problem  at  hand. 

2.1  Probabilistic  calibration,  exceedance  calibration  and  marginal  calibration 

Henceforth,  (Ft)t=i,2,...  and  (Gt)t= 1,2,...  denote  sequences  of  continuous  and  strictly  increasing  cumu¬ 
lative  distribution  functions,  possibly  depending  on  stochastic  parameters.  We  think  of  (Gt)t= 1,2,... 
as  the  true  data  generating  process  and  of  (Ft)t= 1,2,...  as  the  associated  sequence  of  probabilistic 
forecasts.  The  following  definition  refers  to  the  asymptotic  compatibility  between  the  data  gener¬ 
ating  process  and  the  predictive  distributions  in  terms  of  three  major  modes  of  calibration.  Given 
that  (Ft)t= and  (Gt)t= 1,2,...  might  depend  on  stochastic  parameters,  convergence  is  understood 
as  almost  sure  convergence  and  is  denoted  by  an  arrow. 

Definition  1  (modes  of  calibration) 

(a)  The  sequence  (Ft)t=i,2,...  is  probabilistically  calibrated  relative  to  the  sequence  (Gt)t= 1,2,...  if 

1  T 

Ft1{p)  — >p  for  all  pG(0,l).  (3) 

1  t=  1 

(b)  The  sequence  (Ft)t=i,2,...  is  exceedance  calibrated  relative  to  {Gt)t= 1,2,...  if 

1  T 

—  ^2  Gt  1  o  Ft(x)  — ►  x  for  all  x  6  R.  (4) 

1  t= 1 

(c)  The  sequence  (Ft)t=i,2,...  is  marginally  calibrated  relative  to  (Gt)t=i,2,...  if  the  limits  G(x)  = 
lim^oo  y  Y%=i  Gt(x)  and  F(x)  =  lirriT’^oc  ^  J2t= 1  Ft(x)  exist  and  equal  each  other  for  all 
iGR,  and  if  the  common  limit  distribution  places  all  mass  on  finite  values. 

(d)  The  sequence  (Ft)t=  1,2,...  is  strongly  calibrated  relative  to  (Gt)t= 1,2,...  if  it  is  probabilistically 
calibrated,  exceedance  calibrated  and  marginally  calibrated. 

If  each  subsequence  of  (Ft)t=  1,2,...  is  probabilistically  calibrated  relative  to  the  associated  sub¬ 
sequence  of  {Gt)t=  1,2,...)  we  talk  of  complete  probabilistic  calibration.  Similarly,  we  define  com¬ 
pleteness  for  exceedance  calibration,  marginal  calibration  and  strong  calibration.  In  the  examples 
below,  calibration  will  generally  be  complete.  Probabilistic  calibration  is  essentially  equivalent  to 
the  uniformity  of  the  probability  integral  transform.  Exceedance  calibration  is  defined  in  terms  of 
thresholds,  and  marginal  calibration  requires  that  the  limit  distributions  G  and  F  exist  and  equal 
each  other.  The  existence  of  G  is  a  natural  assumption  in  meteorological  problems  and  corresponds 
to  the  existence  of  a  stable  climate.  Hence,  marginal  calibration  can  be  interpreted  in  terms  of  the 
equality  of  actual  climatology  and  forecast  climatology. 

Various  authors  have  studied  calibration  in  the  context  of  probability  forecasts  for  sequences 
of  binary  events  (Dawid  1982,  1985a,  1985b;  Oakes  1985;  Schervish  1985,  1989).  The  progress 
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is  impressive  and  culminates  in  the  paper  by  Foster  and  Vohra  (1998),  who  viewed  the  prediction 
problem  as  a  game  played  against  nature  as  well.  Krzysztofowicz  (1999)  discussed  calibration  in  the 
context  of  Bayesian  forecasting  systems,  and  Krzysztofowicz  and  Sigrest  (1999)  studied  calibration 
for  quantile  forecasts  of  quantitative  precipitation.  However,  we  are  unaware  of  any  prior  discussion 
of  notions  of  calibration  for  probabilistic  forecasts  of  continuous  variables. 


2.2  Examples 

The  examples  in  this  section  illustrate  the  aforementioned  modes  of  calibration  and  discuss  some  of 
the  forecasters  in  our  initial  simulation  study.  Unless  noted  otherwise,  (nt)t= 1,2,...,  1,2,...  and 

(jt)t= 1,2,...  denote  independent  sequences  of  independent  identically  distributed  random  variables. 
We  write  J\f(p,o2)  for  the  normal  distribution  with  mean  p  and  variance  o2,  identify  distribu¬ 
tions  and  cumulative  distribution  functions,  respectively,  and  let  $  denote  the  standard  normal 
cumulative. 


Example  1  (climatological  forecaster) 

Gt  =  M{nt ,  1)  where  jit  ~  AA(0, 1) 

Ft  =  A/"(0, 2)  for  all  t 

The  climatological  forecaster  is  probabilistically  calibrated  and  marginally  calibrated,  but  not  ex¬ 
ceedance  calibrated.  The  claim  for  marginal  calibration  is  obvious.  Putting  p  =  Ft(x )  in  (3),  we 
see  that  probabilistic  calibration  holds,  too.  However, 


1 

T 


E<E 


t= 1 


o  Ft(x ) 


E 

t= 1 


$ 


-1 


+  Ah 


x 

72 


for  x  £  M,  in  violation  of  exceedance  calibration. 


The  characteristic  property  in  Example  1  is  that  the  predictive  distributions,  Ft,  all  equal  na¬ 
ture’s  limiting  distribution,  G.  We  call  any  forecaster  with  this  property  a  climatological  forecaster. 
For  climatological  forecasts,  probabilistic  calibration  is  essentially  equivalent  to  marginal  calibra¬ 
tion.  Indeed,  if  G  is  continuous  and  strictly  increasing,  then  putting  p  =  Ft(x)  =  G(x)  in  (3) 
recovers  the  marginal  calibration  condition.  In  practice,  climatological  forecasts  are  constructed 
from  historical  records  of  the  observations,  and  they  are  often  used  as  reference  forecasts. 


Example  2  (unfocused  forecaster) 

Gt  =  1)  where  pt  ~  AA(0, 1) 

Ft  =  \  (A r(nt,  1)  +  A f(nt  +  n,  1))  where  pr (rt  =  ±1)  =  \ 

The  unfocused  forecaster  is  probabilistically  calibrated  relative  to  (Gt)t= i,2,...>  but  neither  ex¬ 
ceedance  calibrated  nor  marginally  calibrated.  To  prove  the  claim  for  probabilistic  calibration, 
put  3>±(x)  =  |(<h(x)  +  <h(x  1))  and  note  that 


1 

T 


E  Gt  o  F^ 1  (p) 


t=i 


1  r 

2  - 


o  <h+1(p)  +  $o  <h_i(p) 


-i/ 


=  P, 
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where  the  equality  follows  upon  putting  p  =  <h+(x),  substituting  and  simplifying.  Exceedance 
calibration  does  not  hold,  because 


■k  Gt  1  °  Ft{x)  — ►  \  1$  1  o  4>+(x)  +  $  1  o  $_(x)l  +  x 

1  t= l  z 

in  general.  The  marginal  calibration  condition  is  violated,  because  nature’s  limit  distribution, 
G  =  jV(0, 2),  does  not  equal  F  =  ijV(0, 2)  +  |A/"(— 1, 2)  +  | jV(l, 2). 

Example  3  (sign-biased  forecaster) 

Gt  =  A/"(r*,  1)  where  pr(r*  =  ±1)  =  \ 

Ft  =  Af(—rt,  1) 

The  sign-biased  forecaster  is  exceedance  calibrated  and  marginally  calibrated,  but  not  probabilis¬ 
tically  calibrated.  Specifically, 

T 

^  £  Gt  °  Ft~\p )  —  \  [<f>  ($"1(p)  -  2)  +  P )  +  2)]  /  p 

1  t=  l  z 

in  general.  However, 

1  T  1  r 

oFt(x) — (x  +  2)  +  (®  -  2)  =x 

1  t= l  z 

for  i£l.  The  claim  for  marginal  calibration  is  obvious. 

Example  4  (mean-biased  forecaster) 

Gt  =  N{nt,  1)  where  pt  ~  J\f{0, 1) 

Ft  =  A /'(/it  +  rt,  1)  where  pr(rt  =  ±1)  =  | 

The  mean-biased  forecaster  is  exceedance  calibrated  but  neither  probabilistically  calibrated  nor 
marginally  calibrated.  Specifically, 

T 

^Gto  Ft~\p)  — >  \  [$  ($"1(p)  -  l)  +  $  (V1  (p)  +  l)]  / P 

1  t= l  z 

in  general,  while 

!  T  i 

1  oi?i(x)  — 'o  (*+!)  +  (*-!)  =* 

J  t=i  z 

for  i£l.  The  marginal  calibration  condition  does  not  hold,  because  nature’s  limit  distribution, 
G  =  jV(0,2),  differs  from  F  =  ^(jV’(— 1,2)  +AA(1,2)). 

We  now  return  to  the  climatological  forecaster  in  Example  1,  with  the  roles  of  nature  and  the 
forecaster  interchanged. 
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Table  2:  The  three  major  modes  of  calibration  are  logically  independent  of  each  other  and  may 
occur  in  any  combination.  For  instance,  the  sign-biased  forecaster  in  Example  3  is  exceedance 
calibrated  (E)  and  marginally  calibrated  (M)  but  not  probabilistically  calibrated  (P). 


Properties 

Example 

PEM 

Ft  =  Gt  =  AA(0, 1) 

PEM 

Ft  =  Gt=  A f(t,  1) 

PEM 

Example  1  (climatological  forecaster) 

PEM 

Example  2  (unfocused  forecaster) 

PEM 

Example  3  (sign-biased  forecaster) 

PEM 

Example  4  (mean-biased  forecaster) 

PEM 

Example  5  (reverse  climatological  forecaster) 

PEM 

Gt  =  1);  Ft=M(-t,  1) 

Example  5  (reverse  climatological  forecaster) 

Gt  =  M(0, 2)  for  all  t 

Ft  =  N{nt,  1)  where  pt  ~  JV(0, 1) 

The  reverse  climatological  forecaster  is  marginally  calibrated,  but  neither  probabilistically  cali¬ 
brated  nor  exceedance  calibrated.  To  prove  the  claim  for  probabilistic  calibration,  let  Z  be  a 
standard  normal  random  variable  and  note  that 


t= l 


t= l 


v/2 


E 


$ 


+  Z 

V2  , 


( p(z )  $ 


1(p)  —  z 


dz  /  p 


in  general.  Exceedance  calibration  does  not  hold,  because 


T  T 

^  Gtl  0  Ft(x)  =  — >V2i 


t= 1 


t= 1 


for  i£l.  The  claim  for  marginal  calibration  is  obvious. 


The  examples  in  this  section  show  that  probabilistic  calibration,  exceedance  calibration  and 
marginal  calibration  are  logically  independent  of  each  other  and  may  occur  in  any  combination. 
Table  2  summarizes  the  respective  results. 


2.3  Hamill’s  forecaster 

We  add  a  discussion  of  Hamill’s  forecaster.  As  described  in  Table  1,  Hamill’s  forecaster  is  a  master 
forecaster  who  assigns  the  prediction  task  with  equal  probability  to  any  of  three  student  forecasters, 
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each  of  whom  is  biased.  In  response  to  nature’s  choice,  Gt  =  Af{pt,  1),  the  student  forecasters  issue 
the  predictive  distributions  Ft  =  Af(pt  —  5, 1),  Ft  =  A f{pt.+  5, 1)  and  Ft  =  Af(nt,  yg§),  respectively. 
For  Hamill’s  forecaster, 


1 

T 


J2GtoFt-\p) 


t= 1 


1 

3 


4>  4>  (p) 


=  P  +  e(p) 


where  |e(p)|  <  0.0032  for  all  p  but  e(p)  7^  0  in  general.  The  probabilistic  calibration  condition 
(3)  is  violated,  but  only  slightly  so,  resulting  in  deceptively  uniform  histograms  of  the  probability 
integral  transforms.  As  for  exceedance  calibration,  note  that 


1 

T 


E<E 


t= 1 


0  Ft(p) 
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for  x  E  M.  Hence,  Hamill’s  forecaster  is  not  exceedance  calibrated  either,  nor  marginally  calibrated, 
given  that  G  =  Af(0,  2)  while  F  =  |(A/r(— ^,2)  +  Af{\,  2)  +  Af(0,  fg§))- 


2.4  Sharpness  principle 

Ideally,  probabilistic  forecasts  aim  to  honor  the  data  generating  process,  resulting  in  the  equality 
(1)  of  nature’s  proposal  distribution,  Gt,  and  the  predictive  distribution,  Ft,  that  characterizes 
the  perfect  forecaster.  Operationally,  we  adopt  the  paradigm  of  maximizing  the  sharpness  of  the 
predictive  distributions  subject  to  calibration.  Our  conjectured  sharpness  principle  contends  that 
the  two  goals  —  perfect  forecasts  and  the  maximization  of  sharpness  subject  to  calibration  —  are 
indeed  equivalent.  This  conjectured  equivalence  could  be  explained  in  two  distinct  ways.  One 
explanation  is  that  sufficiently  strong  notions  of  calibration  imply  asymptotic  equivalence  to  the 
perfect  forecaster.  We  are  unaware  of  any  strongly  calibrated  forecasts  that  are  not  minor  variants  of 
perfect  forecasts,  and  it  would  be  interesting  to  find  such  an  example,  or  to  prove  that  sequences  of 
this  type  do  not  exist.  An  alternative  and  weaker  explanation  states  that  any  sufficiently  calibrated 
forecaster  is  at  least  as  spread  out  as  the  perfect  forecaster. 

With  respect  to  this  latter  explanation,  none  of  probabilistic,  exceedance  or  marginal  calibration 
alone  is  sufficiently  stark.  In  the  examples  below  it  will  be  convenient  to  consider  a  probabilistic 
calibration  condition, 

1  T 

-  Gt  O  Ft  1  (p)  =  p  for  all  pG(0,l),  (5) 

1  t= 1 

for  finite  sequences  (Ft)i<t<T  relative  to  ( Gt)\<t<T ,  and  similarly  for  exceedance  calibration  and 
marginal  calibration.  Using  randomization,  the  examples  extend  to  countable  sequences  in  obvious 
ways.  Now  suppose  that  <r>0,  a>l,  0<A<l/a  and  T  =  2.  Let  G 1  and  G2  be  continuous  and 
strictly  increasing  distributions  functions  with  associated  densities  that  are  symmetric  about  zero 
and  have  finite  variance,  var(G*  1 )  =  a2  and  var(GY2)  =  A  a2.  If  we  define 

Fi(x)  =  1  (E(*)  +  G2 0))  ,  F2(x)  =  Fi(ax), 


then 


var(Fi)  +  var(F2) 


\  (l  +  “2)  (1  +  °2a2)  0-2  <  (1  +  A2)  cr2 


var(Gi)  +  var(G2), 
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even  though  the  finite  probabilistic  calibration  condition  (5)  holds.  A  similar  example  can  be  given 
for  exceedance  calibration.  Suppose  that  cr  >  0,  0<a<l  and 


0  <  A  <  a 


3  +  a  y/2 
1  +  3a  / 


Let  G i  and  G2  be  as  above  and  define 

Then 


F2(x) 


g2 


2  ax  \ 
1  +  a  / 


var  ( F\ )  +  var  ( F2 ) 


(1  +  a y 


a2  <  (1  +  A2)  a2 


var(Gi)  +  var(G2). 


even  though  the  finite  exceedance  calibration  condition  holds.  Finally,  the  reverse  climatologi¬ 
cal  forecaster  shows  that  a  forecaster  can  be  marginally  calibrated  yet  sharper  than  the  perfect 
forecaster. 

For  climatological  forecasts,  however,  finite  probabilistic  calibration  and  finite  marginal  calibra¬ 
tion  are  equivalent,  and  a  weak  form  of  the  sharpness  principle  holds. 


Theorem  1  Suppose  that  G i, . . . ,  Gt  and  F\  =  •  •  •  =  Ft  =  F  have  second  moments  and  satisfy 
the  finite  probabilistic  calibration  condition  (5).  Then 

1  T  1  T 

-  var (Ft)  =  var (F)  >  -  ^  var(Gi) 

1  t= l  1  t=  l 

with  equality  if  and  only  if  E(Gi)  =  •  •  •  =  E (Gt)- 

The  proof  of  Theorem  1  is  given  in  the  appendix.  We  are  unaware  of  any  other  results  in  this 
direction;  in  particular,  we  do  not  know  whether  a  non-climatological  forecaster  can  be  probabilis¬ 
tically  calibrated  and  marginally  calibrated  yet  sharper  than  the  perfect  forecaster. 


3  Diagnostic  tools 

We  now  discuss  diagnostic  tools  for  the  evaluation  of  predictive  performance.  In  accordance  with 
Dawid’s  (1984)  prequential  principle,  the  assessment  of  probabilistic  forecasts  needs  to  be  based 
on  the  predictive  distributions  and  the  observations  only.  Previously,  we  defined  notions  of  cal¬ 
ibration  in  terms  of  the  asymptotic  consistency  between  the  probabilistic  forecasts  and  the  data 
generating  distributions,  which  are  unavailable  in  practice.  However,  we  obtain  sample  versions 
by  substituting  empirical  distribution  functions  based  on  the  observations.  In  the  following,  this 
program  is  carried  out  for  probabilistic  calibration  and  marginal  calibration.  Probabilistic  calibra¬ 
tion  is  essentially  equivalent  to  the  uniformity  of  the  probability  integral  transform,  and  marginal 
calibration  corresponds  to  the  equality  of  observed  climate  and  forecast  climate.  Exceedance  cal¬ 
ibration  does  not  allow  for  a  sample  analogue,  given  the  ambiguities  in  inverting  a  step  function. 
We  discuss  graphical  displays  of  sharpness  and  propose  the  use  of  proper  scoring  rules,  that  assign 
numerical  measures  of  predictive  performance,  address  calibration  as  well  as  sharpness,  and  find 
key  applications  in  the  ranking  of  competing  forecast  procedures. 
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3.1  Assessing  probabilistic  calibration 

The  probability  integral  transform  (PIT)  is  the  value  that  the  predictive  cumulative  distribution 
function  attains  at  the  observation.  Specifically,  if  Ft  is  the  predictive  distribution  and  xt  materi¬ 
alizes,  the  transform  is  defined  as  pt  =  Ft(xt).  The  literature  usually  refers  to  Rosenblatt  (1952), 
even  though  the  probability  integral  transform  can  be  traced  back  at  least  to  Pearson  (1933).  The 
connection  to  probabilistic  calibration  is  established  by  substituting  the  empirical  distribution  func¬ 
tion  1  {xt  <  x}  for  the  data  generating  distribution  Gt(x),  x  £  M  in  the  probabilistic  calibration 
condition  (3),  and  noting  that  the  indicator  functions  1  {xt  <  Ff  (p)}  and  l{pt  <  p}  are  identi¬ 
cal.  The  following  theorem  characterizes  the  asymptotic  uniformity  of  the  empirical  sequence  of 
probability  integral  transforms  in  terms  of  probabilistic  calibration.  We  state  this  result  under  the 
assumption  of  a  *-nrixing  sequence  of  observations  (Blum,  Hanson  and  Koopmans  1963). 

Theorem  2  Let  (Ft)t= 1,2,...  and  {Gt)t= 1,2,...  be  sequences  of  continuous,  strictly  increasing  distri¬ 
bution  functions.  Suppose  that  xt  has  distribution  Gt  and  that  the  xt  form  a  *-mixing  sequence  of 
random  variables.  Then 


;  HPt  <  p}  — >  p  almost  surely  for  all  p 


t= 1 


if  and  only  if  (Ft)t=i,2,...  is  probabilistically  calibrated  with  respect  to  (Gt)t= 1,2,... - 


(6) 


The  proof  of  this  result  is  given  in  the  appendix,  and  the  equivalence  remains  valid  under 
alternative  weak  dependence  assumptions  for  the  observations.  Essentially,  the  theorem  states 
that  the  asymptotic  uniformity  of  the  PIT  histogram  is  a  necessary  and  sufficient  condition  for 
probabilistic  calibration.  Indeed,  following  the  lead  of  Dawid  (1984)  and  Diebold,  Gunther  and 
Tay  (1998),  checks  for  the  uniformity  of  the  PIT  values  have  formed  a  cornerstone  of  forecast 
evaluation. 

Uniformity  is  usually  evaluated  in  an  exploratory  sense,  and  one  way  of  doing  this  is  by  plotting 
the  empirical  cumulative  distribution  function  of  the  PIT  values  and  comparing  to  the  identity 
function.  This  approach  is  adequate  for  small  sample  sizes  and  notable  departures  from  unifor¬ 
mity,  and  its  proponents  include  Stael  von  Holstein  (1970,  p.  142),  Seillier-Moiseiwitsch  (1993), 
Hoeting  (1994,  p.  33),  Fruhwirth-Schnatter  (1996),  Clements  and  Smith  (2000),  Moyeed  and  Pa- 
pritz  (2002),  Wallis  (2003)  and  Boero  and  Marrocu  (2004).  Histograms  of  the  probability  integral 
transform  accentuate  departures  from  uniformity  when  the  sample  size  is  large  and  the  deviations 
from  uniformity  are  small.  This  alternative  type  of  display  was  used  by  Diebold,  Gunther  and 
Tay  (1998),  Weigend  and  Shi  (2000),  Bouwens,  Giot,  Grammig  and  Veredas  (2004)  and  Gneiting, 
Raftery,  Westveld  and  Goldman  (2005),  among  others,  and  10  or  20  histogram  bins  generally  seem 
adequate.  Figure  1  uses  20  bins  and  shows  the  PIT  histograms  for  the  various  forecasters  in  our 
initial  simulation  study.  The  histograms  are  essentially  uniform.  Table  3  shows  the  empirical  cov¬ 
erage  of  the  associated  central  50%  and  90%  prediction  intervals.  This  information  is  redundant, 
since  the  empirical  coverage  can  be  read  off  the  PIT  histogram,  namely  as  the  area  under  the  10 
and  18  central  bins,  respectively. 

Probabilistic  weather  forecasts  are  typically  based  on  ensemble  prediction  systems,  which  gen¬ 
erate  a  set  of  perturbations  of  the  best  estimate  of  the  current  state  of  the  atmosphere,  run  each  of 
them  forward  in  time  using  a  numerical  weather  prediction  model,  and  use  the  resulting  set  of  fore¬ 
casts  as  a  sample  from  the  predictive  distribution  of  future  weather  quantities  (Palmer  2002).  The 
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Table  3:  Empirical  coverages  of  central  prediction  intervals.  The  nominal  coverages  are  50%  and 
90%,  respectively. 


Interval 

50% 

90% 

Perfect  forecaster 

51.2% 

90.0% 

Climatological  forecaster 

51.3% 

90.7% 

Unfocused  forecaster 

50.1% 

90.1% 

Hamill’s  forecaster 

50.9% 

89.5% 

principal  device  for  assessing  the  calibration  of  ensemble  forecasts  is  the  verification  rank  histogram 
or  Talagrand  diagram,  proposed  independently  by  Anderson  (1996),  Hanrill  and  Colucci  (1997)  and 
Talagrand,  Vautard  and  Strauss  (1997),  and  extensively  used  since.  To  obtain  a  verification  rank 
histogram,  find  the  rank  of  the  observation  when  pooled  within  the  ordered  ensemble  values  and 
plot  the  histogram  of  the  ranks.  If  we  identify  the  predictive  distribution  with  the  empirical  cu¬ 
mulative  distribution  function  of  the  ensemble  values,  this  technique  is  seen  to  be  equivalent  to 
plotting  a  PIT  histogram.  A  similar  procedure  could  be  drawn  on  fruitfully  to  assess  samples  from 
posterior  predictive  distributions  obtained  by  Markov  chain  Monte  Carlo  techniques.  Shephard 
(1994,  p.  129)  gave  an  instructive  example  of  how  this  could  be  done. 

Visual  inspection  of  a  PIT  or  rank  histogram  often  provides  hints  to  the  reasons  for  forecast 
deficiency.  Hump  shaped  histograms  indicate  overdispersed  predictive  distributions  with  prediction 
intervals  that  are  too  wide  on  average.  U-shaped  histograms  often  correspond  to  predictive  distri¬ 
butions  that  are  too  narrow.  Triangle-shaped  histograms  are  seen  when  the  predictive  distributions 
are  biased.  Formal  tests  of  uniformity  can  also  be  employed  and  have  been  studied  by  Anderson 
(1996),  Talagrand,  Vautard  and  Strauss  (1997),  Noceti,  Smith  and  Hodges  (2003),  Wallis  (2003) 
and  Garratt,  Lee,  Pesaran  and  Shin  (2003),  among  others.  However,  the  use  of  formal  tests  is  often 
hindered  by  complex  dependence  structures,  particularly  in  cases  in  which  the  probability  integral 
transforms  are  spatially  aggregated.  Hanrill  (2001)  gave  a  thoughtful  discussion  of  the  associated 
issues  and  potential  fallacies. 

In  the  context  of  time  series,  the  observations  are  sequential,  and  the  predictive  distributions 
correspond  to  sequential  k- step  ahead  forecasts.  The  probability  integral  transforms  for  perfect  k- 
step  ahead  forecasts  are  at  most  (k  —  Independent,  and  this  assumption  can  be  checked  empirically, 
by  plotting  the  sample  autocorrelation  function  for  the  PIT  values  and  the  higher  moments  thereof. 
This  approach  was  applied  by  Diebold,  Gunther  and  Tay  (1998),  Weigend  and  Shi  (2000),  Bauwens 
et  al.  (2004)  and  Campbell  and  Diebold  (2005),  among  others.  Figures  2  and  3  show  the  sample 
autocorrelation  functions  for  the  PIT  values  and  the  various  forecasters  in  our  initial  simulation 
study,  for  independent  states,  nt,  and  for  serially  dependent  states,  respectively.  Smith  (1985), 
Friihwirth-Schnatter  (1996)  and  Berkowitz  (2001)  proposed  an  assessment  of  independence  based 
on  the  transformed  PIT  values,  d5"1^),  which  are  Gaussian  under  the  assumption  of  perfect 
forecasts.  This  further  transformation  has  obvious  advantages  when  formal  tests  of  independence 
are  employed,  and  seems  to  make  little  difference  otherwise. 
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3.2  Assessing  marginal  calibration 

Marginal  calibration  concerns  the  equality  of  actual  climate  and  forecast  climate.  To  assess  marginal 
calibration,  we  propose  a  comparison  of  the  empirical  cumulative  distribution  function, 

1  T 

Gt{x)  = —^2l{xt<x},  xGM,  (7) 

1  t= l 

to  the  forecast  climate,  represented  by  the  average  predictive  cumulative  distribution  function, 

1  T 

Ft(x)  =  -J2Ft{x),  x£l.  (8) 

1  t.= l 

Indeed,  if  we  substitute  the  indicator  function  1  {xt  <  xj  for  the  data  generating  distribution  Gt(x), 
x  G  M  in  the  definition  of  marginal  calibration,  we  are  led  to  the  asymptotic  equality  of  Gt  and  Ft, 
respectively.  Theorem  3  provides  a  rigorous  version  of  this  correspondence.  Under  mild  regularity 
conditions,  marginal  calibration  is  a  necessary  and  sufficient  condition  for  the  asymptotic  equality 
of  Gt  and  Ft ■  The  proof  of  this  result  is  deferred  to  the  appendix. 

Theorem  3  Let  (Ft)t=i,2,...  and  (Gt)t=i,2,...  be  sequences  of  continuous,  strictly  increasing  dis¬ 
tribution  functions.  Suppose  that  each  xt  has  distribution  Gt  and  that  the  xt  form  a  *-mixing 
sequence  of  random  variables.  Suppose  furthermore  that  F(x)  =  lirri7’^oc  ^  J2t=i  F)(x)  exists  for 
all  x  G  M  and  that  the  limit  function  is  strictly  increasing  on  M.  Then 

1  T 

Gt(x )  =  —  ^2  l{xt  F  x}  — ►  F(x)  almost  surely  for  all  x  G  R  (9) 

1  t=  l 

if  and  only  if  (Ft)t=i,2,...  is  marginally  calibrated  with  respect  to  (Gt)t= 1,2,... • 

The  most  obvious  graphical  device  is  a  plot  of  Gt(x)  and  Fr(x)  versus  x.  However,  it  is 
often  more  instructive  to  plot  the  difference  of  the  two  cumulative  distribution  functions,  as  in  the 
left-hand  side  of  Figure  4  which  shows  the  difference 

Ft(x)  —  Gt{x),  xGR  (10) 

for  the  various  forecasters  in  our  initial  simulation  study.  We  call  this  type  of  display  a  marginal 
calibration  plot.  Under  the  hypothesis  of  marginal  calibration,  we  expect  minor  fluctuations  about 
zero  only,  and  this  is  indeed  the  case  for  the  perfect  forecaster  and  the  climatological  forecaster.  The 
unfocused  forecaster  and  Hamill’s  forecaster  lack  marginal  calibration,  resulting  in  major  excursions 
from  zero.  The  same  information  can  be  visualized  in  terms  of  quantiles,  as  on  the  right-hand  side 
of  Figure  4  which  shows  the  difference, 

Q(FT,q)-Q(GT,q),  q  G  (0, 1)  (11) 

of  the  quantile  functions  for  Ft  and  Gt,  respectively.  Under  the  hypothesis  of  marginal  calibration, 
we  again  expect  minor  fluctuations  about  zero  only,  and  this  is  the  case  for  the  perfect  forecaster 
and  the  climatological  forecaster.  The  unfocused  forecaster  and  Hamill’s  forecaster  show  quan¬ 
tile  difference  functions  that  increase  from  negative  to  positive  values,  thereby  indicating  forecast 
climates  that  are  too  spread  out. 
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Threshold  Value  Cumulative  Probability 


Figure  4:  Marginal  calibration  plot  for  the  perfect  forecaster  (solid  line),  climatological  forecaster 
(short  dashes),  unfocused  forecaster  (dot-dashed  line)  and  Hamill’s  forecaster  (long  dashes).  The 
presentation  is  in  terms  of  cumulative  distribution  functions  (left)  and  in  terms  of  quantiles  (right), 
respectively. 

3.3  Assessing  sharpness 

Sharpness  refers  to  the  concentration  of  the  predictive  distributions  and  is  a  property  of  the  forecasts 
only.  The  more  concentrated  the  predictive  distributions,  the  sharper  the  forecasts,  and  the  sharper 
the  better,  subject  to  calibration.  To  assess  sharpness,  we  use  numerical  and  graphical  summaries 
of  the  width  of  the  associated  prediction  intervals.  For  instance,  Table  4  shows  the  average  width 
of  the  central  50%  and  90%  prediction  intervals  for  the  forecasters  in  our  simulation  study.  The 
perfect  forecaster  is  the  sharpest,  followed  by  Hamill’s  forecaster,  the  unfocused  forecaster  and  the 
climatological  forecaster.  A  fair  comparison  requires  that  the  empirical  coverage  of  the  prediction 
intervals  be  close  to  nominal,  which  we  showed  to  be  true  in  Figure  1  and  Table  3.  In  our  simplistic 
simulation  study,  the  width  of  the  prediction  intervals  is  fixed,  expect  for  Hamill’s  forecaster, 
and  the  tabulation  is  perfectly  adequate.  In  many  types  of  applications,  however,  conditional 
heteroscedasticity  leads  to  considerable  variability  in  the  width  of  the  prediction  intervals.  The 
average  width  then  is  often  insufficient  to  characterize  sharpness,  and  we  follow  Bremnes  (2004)  in 
proposing  boxplots  as  a  more  instructive  graphical  device.  The  resulting  sharpness  diagram  is  an 
important  diagnostic  tool,  and  we  present  an  example  thereof  in  Section  4  below. 
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Table  4:  Average  width  of  central  prediction  intervals.  The  nominal  coverages  are  50%  and  90% 
respectively. 


Interval 

50% 

90% 

Perfect  forecaster 

1.35 

3.29 

Climatological  forecaster 

1.91 

4.65 

Unfocused  forecaster 

1.52 

3.68 

Hamill’s  forecaster 

1.49 

3.62 

Table  5:  Average  logarithmic  score  (LogS)  and  continuous  ranked  probability  score  (CRPS). 


LogS 

CRPS 

Perfect  forecaster 

1.41 

0.56 

Climatological  forecaster 

1.75 

0.78 

Unfocused  forecaster 

1.53 

0.63 

Hamill’s  forecaster 

1.52 

0.61 

3.4  Proper  scoring  rules 

Scoring  rules  assign  numerical  scores  to  probabilistic  forecasts  and  form  attractive  summary  mea¬ 
sures  of  predictive  performance,  in  that  they  address  calibration  and  sharpness  simultaneously.  We 
write  S(F. ,  x)  for  the  score  assigned  when  the  forecaster  issues  the  predictive  distribution  F  and  x 
materializes,  and  we  take  scores  to  be  penalties  that  the  forecaster  wishes  to  minimize  on  average. 
A  scoring  rule  is  proper  if  the  expected  value  of  the  penalty  S(F,x )  for  an  observation  x  drawn 
from  G  is  minimized  if  the  forecast  is  perfect,  that  is,  if  F  =  G.  It  is  strictly  proper  if  the  minimum 
is  unique.  Propriety  is  a  crucial  characteristic  of  scoring  rules;  it  rewards  perfect  forecasts  and 
discourages  hedging.  Winkler  (1977)  gave  an  interesting  discussion  of  the  ways  in  which  proper 
scoring  rules  encourage  sharp  forecasts. 

The  logarithmic  score  is  the  negative  of  the  logarithm  of  the  predictive  density  evaluated  at 
the  observation  (Good  1952;  Bernardo  1979).  The  logarithmic  score  is  proper  and  has  many 
desirable  properties  (Roulston  and  Smith  2002)  yet  lacks  robustness  (Selten  1998;  Gneiting  and 
Raftery  2004).  The  continuous  ranked  probability  score  is  defined  directly  in  terms  of  the  predictive 
cumulative  distribution  function,  F.  namely  as 

/OO 

{F(y)  -  1(2/  >  X ))2  d y,  (12) 

-OO 

and  provides  a  more  robust  alternative.  Gneiting  and  Raftery  (2004)  gave  an  alternative  represen¬ 
tation  and  showed  that 

crps(F,  x)  =  Ep\X  —  x\  —  -Ep\X  —  X'\,  (13) 

where  X  and  X'  are  independent  copies  of  a  random  variable  with  distribution  function  F  and 
finite  first  moment.  The  representation  (13)  shows  that  the  continuous  ranked  probability  score 
generalizes  the  absolute  error,  to  which  it  reduces  if  F  is  a  point  forecast.  Furthermore,  it  can  be 
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Threshold  Value 


Figure  5:  Brier  score  plot  for  the  perfect  forecaster  (solid  line),  climatological  forecaster  (short 
dashes),  unfocused  forecaster  (dot-dashed  line),  and  HamilPs  forecaster  (long  dashes).  The  graphs 
show  the  Brier  score  as  a  function  of  the  threshold  value.  The  area  under  the  associated  curve 
equals  the  CRPS  value  (14). 


reported  in  the  same  unit  as  the  observations.  The  continuous  ranked  probability  score  is  proper, 
and  we  rank  competing  forecast  procedures  based  on  its  average, 


CRPS  =  ^^]crps  (Ft,xt) 

1  t= l 


/OO 

BS(y)  dy, 

-OO 


(14) 


where  BS(y)  =  l  (^t(y)  ~  1  {%t  <  y})2  denotes  the  Brier  score  (Brier  1950)  for  probability 

forecasts  of  the  binary  events  at  the  threshold  value  y  E  M.  The  Brier  score  allows  for  the  distinction 
of  a  calibration  component  and  a  refinement  component  (Murphy  1972;  Blattenberger  and  Lad 
1985),  but  the  decomposition  requires  a  binning  of  the  forecast  probabilities  and  may  not  be  stable 
if  the  binning  is  changed. 

Table  5  shows  the  logarithmic  score  and  the  continuous  ranked  probability  score  for  the  various 
forecasters  in  our  initial  simulation  study,  when  averaged  over  the  10000  replicates  of  the  prediction 
experiment.  As  expected,  both  scoring  rules  rank  the  perfect  forecaster  highest,  followed  by  HamilPs 
forecaster,  the  unfocused  forecaster  and  the  climatological  forecaster.  Figure  5  plots  the  Brier  score 
for  the  associated  binary  forecasts  in  dependence  on  the  threshold  value,  thereby  illustrating  the 
integral  representation  on  the  right-hand  side  of  (14).  This  type  of  display  was  proposed  by  Gerds 
(2002,  Section  2.3)  and  Schumacher,  Graf  and  Gerds  (2003)  who  called  the  graphs  prediction  error 
curves. 
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4  Case  study: 

Probabilistic  forecasts  at  the  Stateline  wind  energy  center 

Wind  power  is  the  fastest-growing  energy  source  today.  Estimates  are  that  within  the  next  15 
years  wind  energy  will  fill  about  6%  of  the  electricity  supply  in  the  United  States.  In  Denmark, 
wind  energy  already  meets  20%  of  the  country’s  total  energy  needs.  However,  arguments  against 
the  proliferation  of  wind  energy  have  been  put  forth,  often  focusing  on  the  perceived  inability  to 
forecast  wind  resources  with  any  degree  of  accuracy.  The  development  of  advanced  probabilistic 
forecast  methodologies  helps  address  these  concerns. 

The  prevalent  approach  to  short-range  forecasts  of  wind  speed  and  wind  power  at  prediction 
horizons  up  to  a  few  hours  uses  on-site  observations  and  autoregressive  time  series  models  (Brown, 
Katz  and  Murphy  1984).  Gneiting,  Larson,  Westrick,  Genton  and  Aldrich  (2004)  proposed  a 
novel  spatio-temporal  approach,  the  regime-switching  space-time  or  RST  method,  that  merges 
meteorological  and  statistical  expertise  to  obtain  fully  probabilistic  forecasts  of  wind  resources. 
Henceforth,  we  illustrate  our  diagnostic  approach  to  the  evaluation  of  predictive  distributions  by  a 
comparison  and  ranking  of  three  competing  forecast  methodologies  for  two-step  ahead  predictions 
of  hourly  average  wind  speed  at  the  Stateline  wind  energy  center  in  the  US  Pacific  Northwest. 
The  evaluation  period  is  May  through  November  2003,  resulting  in  a  total  of  5136  probabilistic 
forecasts. 

4.1  Predictive  distributions  for  hourly  average  wind  speed 

We  consider  three  competing  statistical  prediction  algorithms  for  two-step  ahead  probabilistic  fore¬ 
casts  of  hourly  average  wind  speed,  wt,  at  a  meteorological  tower  in  close  vicinity  of  the  Stateline 
wind  energy  center,  which  is  located  on  the  Vansycle  ridge  at  the  border  between  the  states  of 
Oregon  and  Washington.  The  data  source  is  described  in  Gneiting  et  al.  (2004). 

The  first  method  is  the  persistence  forecast,  a  naive  yet  surprisingly  skillful,  nonparametric 
reference  forecast.  The  persistence  point  forecast  is  simply  the  most  recent  observed  value  of 
hourly  average  wind  speed  at  Stateline.  To  obtain  a  predictive  distribution,  we  dress  the  point 
forecast  with  the  19  most  recent  observed  values  of  the  persistence  error,  which  corresponds  to 
a  naive  version  of  the  approach  of  Roulston  and  Smith  (2003).  Hence,  the  predictive  cumulative 
distribution  function  for  wt+ 2  is  the  empirical  distribution  function  of  the  set 

{max(wt  -  wt-h  +  wt-h- 2,0)  :  h  =  0, . . . ,  18}, 

and  the  associated  prediction  intervals  are  readily  formed  from  the  order  statistics  of  this  set.  The 
second  technique  is  the  autoregressive  time  series  approach  which  was  proposed  by  Brown,  Katz 
and  Murphy  (1984)  and  has  found  widespread  use  since.  To  apply  this  technique,  we  fit  and  extract 
a  diurnal  trend  component  based  on  a  sliding  40-day  training  period,  fit  a  stationary  autoregression 
to  the  residual  component  and  find  a  Gaussian  predictive  distribution  in  the  customary  way.  The 
Gaussian  predictive  distribution  assigns  a  typically  small  positive  mass  to  the  negative  half-axis, 
and  in  view  of  the  nonnegativity  of  the  predictand  we  redistribute  this  mass  to  wind  speed  zero. 
The  details  are  described  in  Gneiting  et  al.  (2004),  where  the  method  is  referred  to  as  the  AR-D 
technique. 

The  third  method  is  the  regime-switching  space-time  (RST)  approach  of  Gneiting  et  al.  (2004). 
The  RST  model  is  parsimonious,  yet  takes  account  of  all  the  salient  features  of  wind  speed:  al¬ 
ternating  atmospheric  regimes,  temporal  and  spatial  autocorrelation,  diurnal  and  seasonal  non- 
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Figure  6:  Probability  integral  transform  (PIT)  histogram  and  sample  autocorrelation  functions 
for  the  first  three  centered  moments,  for  persistence  forecasts  of  hourly  average  wind  speed  at  the 
Stateline  wind  energy  center. 


stationarity,  conditional  heteroscedasticity  and  non-Gaussianity.  The  method  utilizes  offsite  in¬ 
formation  from  the  nearby  meteorological  towers  at  Goodnoe  Hills  and  Kennewick,  identifies  at¬ 
mospheric  regimes  at  the  wind  energy  site  and  fits  conditional  predictive  models  for  each  regime, 
based  on  a  sliding  45-day  training  period.  Details  are  given  in  Gneiting  et  al.  (2004),  where  the 
method  is  referred  to  as  the  RST-D-CH  technique.  Any  minor  discrepancies  in  the  performance 
measures  reported  henceforth  and  in  Gneiting  et  al.  (2004)  stem  from  the  use  of  R  versus  Splus  and 
the  associated  slight  differences  in  the  optimization  algorithms  used  for  estimating  the  predictive 
models. 

4.2  Assessing  calibration 

Figures  6,  7  and  8  show  the  probability  integral  transform  (PIT)  histograms  for  the  three  forecast 
techniques,  along  with  the  sample  autocorrelation  functions  for  the  first  three  centered  moments 
of  the  PIT  values  and  the  associated  Bartlett  confidence  intervals.  The  PIT  histograms  for  the 
persistence  forecasts  and  for  the  RST  forecasts  appear  uniform.  The  histogram  for  the  autoregres¬ 
sive  forecasts  is  hump  shaped,  thereby  suggesting  departures  from  probabilistic  calibration.  Table 
6  shows  the  associated  empirical  coverage  of  the  50%  and  90%  central  prediction  intervals. 
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Figure  7:  Same  as  Figure  6,  but  for  autoregressive  forecasts. 


The  PIT  values  for  perfect  2-step  ahead  forecasts  are  at  most  1-dependent,  and  the  sample 
autocorrelation  functions  for  the  RST  forecasts  seem  compatible  with  this  assumption.  The  sample 
autocorrelations  for  the  persistence  forecasts  are  nonnegligible  at  lag  2,  and  the  centered  second 
moment  of  the  PIT  values  shows  notable  negative  correlations  at  lags  between  15  and  20  hours. 
These  features  indicate  a  lack  of  fit  of  the  predictive  model  but  seem  hard  to  interpret  diagnos¬ 
tically.  The  respective  sample  autocorrelations  for  the  autoregressive  forecasts  are  positive  and 
nonnegligible  at  lags  up  to  5  hours,  thereby  pointing  at  conditional  heteroscedasticity  in  the  wind 
speed  series.  Indeed,  Gneiting  et  al.  (2004)  showed  that  the  autoregressive  forecasts  improve  when 
a  conditionally  heteroscedastic  model  is  employed.  In  the  current,  classical  autoregressive  formu¬ 
lation  the  predictive  variance  varies  as  a  result  of  the  sliding  training  period,  but  high-frequency 
changes  in  the  predictability  are  not  taken  into  account. 

Figure  9  shows  marginal  calibration  plots  for  the  three  forecasts,  both  in  terms  of  cumulative 
distribution  functions  and  in  terms  of  quantiles.  The  display  is  in  analogy  to  Figure  4  and  the 
graphs  show  the  differences  defined  in  (10)  and  (11),  respectively.  The  graphs  for  all  three  forecasts 
show  nonnegligible  excursions  from  zero,  particularly  at  small  wind  speeds,  and  the  excursions 
are  most  pronounced  for  the  autoregressive  forecasts.  The  lack  of  predictive  model  fit  can  be  ex¬ 
plained  by  a  closer  examination  of  Figure  10,  which  shows  the  empirical  cumulative  distribution 
function,  Ft,  of  hourly  average  wind  speed  during  the  evaluation  period.  Hourly  average  wind 
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Figure  8:  Same  as  Figure  6,  but  for  RST  forecasts. 

speeds  less  than  1  m-s_1  were  almost  never  observed.  This  is  incompatible  with  the  mean  predic¬ 
tive  distribution  functions,  Gt,  for  the  three  methods,  each  of  which  assigns  positive  point  mass 
to  wind  speed  zero,  resulting  in  the  initial  positive  excursions  in  the  left-hand  plot  of  Figure  9. 
This  issue  applies  to  all  three  techniques,  and  for  the  persistence  forecasts  and  the  autoregressive 
forecasts  it  is  not  clear  how  it  could  be  addressed.  The  RST  method  fits  cut-off  normal  predictive 
distributions  that  are  concentrated  on  the  nonnegative  half- axis  and  involve  a  point  mass  at  zero. 
The  marginal  calibration  plot  suggests  the  use  of  truncated  normal  predictive  distributions  as  a 
promising  alternative. 

4.3  Assessing  sharpness 

Sharpness  concerns  the  concentration  of  the  predictive  distributions,  and  we  consider  the  central 
prediction  interval  at  the  50%  and  90%  level,  respectively.  Table  6  shows  that  the  empirical 
coverage  is  close  to  nominal  for  all  three  techniques;  hence,  an  assessment  of  the  sharpness  of  the 
predictive  distributions  in  terms  of  the  width  of  the  prediction  intervals  is  fair.  The  boxplots  in  the 
sharpness  diagram  of  Figure  11  show  the  5th,  25th,  50th,  75th  and  95th  percentile  of  the  width  of 
the  interval  for  the  5136  predictive  distributions  during  the  evaluation  period,  and  Table  7  shows 
the  associated  average  width.  The  prediction  intervals  for  the  persistence  forecasts  vary  the  most 
in  width,  followed  by  the  RST  forecasts  and  the  autoregressive  forecasts.  The  RST  forecasts  are 
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Figure  9:  Marginal  calibration  plot  for  persistence  forecasts  (dashed  line),  autoregressive  forecasts 
(dot-dashed  line)  and  RST  forecasts  (solid  line)  of  hourly  average  wind  speed  at  the  Stateline  wind 
energy  center  in  terms  of  cumulative  distribution  functions  (left)  and  in  terms  of  quantiles  (right), 
respectively,  in  nr  •  s_1. 


Figure  10:  Empirical  cumulative  distribution  function  of  hourly  average  wind  speed  at  the  Stateline 
wind  energy  center  in  May  through  November  2003,  in  m-s_1. 
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Table  6:  Empirical  coverage  of  central  prediction  intervals.  The  nominal  coverages  are  50%  and 
90%,  respectively. 


Interval 

50% 

90% 

Persistence  forecast 

50.9% 

89.2% 

Autoregressive  forecast 

55.6% 

90.4% 

RST  forecast 

51.2% 

88.4% 

clearly  the  sharpest,  with  prediction  intervals  that  are  about  20%  shorter  on  average  than  those 
for  the  autoregressive  forecasts. 

4.4  Continuous  ranked  probability  score 

Table  8  shows  the  mean  continuous  ranked  probability  score  or  CRPS  value  (14)  for  the  various 
forecasts.  We  report  the  scores  month  by  month,  which  allows  for  an  assessment  of  seasonal  effects 
and  straightforward  tests  of  the  null  hypothesis  of  no  difference  in  the  predictive  performance  of 
competing  probabilistic  forecasts.  For  instance,  the  RST  forecasts  had  a  lower  CRPS  value  than  the 
autoregressive  forecasts  in  each  month  during  the  evaluation  period,  May  through  November  2003. 
Under  the  null  hypothesis  of  equal  predictive  performance  this  happens  with  probability  ( \)7  =  y^g 
only.  Similarly,  the  autoregressive  forecasts  outperformed  the  persistence  forecasts  in  May  through 
October  2003,  but  not  in  November  2003.  Clearly,  various  other  tests  can  be  employed,  but  one 
needs  to  be  careful  to  avoid  dependencies  in  the  forecast  differentials.  In  our  situation,  the  results 
for  distinct  months  can  be  considered  independent  for  all  practical  purposes.  Diebold  and  Mariano 
(1995)  gave  a  thoughtful  discussion  of  these  issues,  and  we  refer  to  their  work  for  a  comprehensive 
account  of  tests  of  predictive  performance.  Figure  12  illustrates  the  Brier  score  decomposition  (14) 
of  the  CRPS  value  for  the  entire  evaluation  period.  The  RST  forecasts  outperform  the  persistence 
forecasts  and  the  autoregressive  forecasts  at  all  thresholds. 

We  noted  in  Section  3.4  that  the  continuous  ranked  probability  score  generalizes  the  absolute 
error,  and  reduces  to  the  latter  for  point  forecasts.  Table  9  shows  the  mean  absolute  error  (MAE)  for 
the  point  forecasts  associated  with  the  persistence,  autoregressive  and  RST  techniques,  respectively. 
The  persistence  point  forecast  is  simply  the  most  recent  observed  value  of  the  hourly  average  wind 
speed  at  the  Stateline  wind  energy  center.  The  autoregressive  point  forecast  is  the  mean  of  the 
associated  predictive  distribution,  and  similarly  for  the  RST  forecast.  The  results  for  the  predictive 
median  are  very  similar.  The  RST  point  forecasts  outperform  the  autoregressive  point  forecasts, 
and  the  autoregressive  point  forecasts  outperform  the  persistence  point  forecasts.  The  MAE  values 
in  Table  9  and  the  CRPS  values  in  Table  8  are  reported  in  the  same  unit  as  the  wind  speed 
measurements,  that  is,  in  m  •  s_1,  and  can  be  directly  compared.  The  insights  that  the  monthly 
scores  provide  are  indicative  of  the  potential  benefits  of  thoughtful  stratification. 

The  CRPS  and  MAE  values  establish  a  clear-cut  ranking  of  the  competing  forecast  method¬ 
ologies  that  places  the  RST  technique  first,  followed  by  the  autoregressive  and  the  persistence 
forecasts.  The  RST  method  also  performed  best  in  terms  of  probabilistic  calibration  and  marginal 
calibration,  and  the  RST  forecasts  were  much  sharper  than  the  autoregressive  and  the  persistence 
forecasts.  The  diagnostic  approach  furthermore  points  at  forecast  deficiencies  and  suggests  poten¬ 
tial  improvements  to  the  predictive  models.  In  particular,  the  marginal  calibration  plots  in  Figure 
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Table  7:  Average  width  of  central  prediction  intervals,  in  m-s  1 .  The  nominal  coverages  are  50% 
and  90%,  respectively. 


Interval 

50% 

90% 

Persistence  forecast 

2.63 

7.51 

Autoregressive  forecast 

2.74 

6.55 

RST  forecast 

2.20 

5.31 

PS  50  AR  50  RST  50  PS  90  AR  90  RST  90 


Figure  11:  Sharpness  diagram  for  persistence  forecasts  (PS),  autoregressive  forecasts  (AR)  and 
RST  forecasts  of  hourly  average  wind  speed  at  the  Stateline  wind  energy  center.  The  boxplots 
show  the  5th,  25th,  50th,  75th  and  95th  percentile  of  the  width  of  the  central  prediction  interval, 
in  m-s-1.  The  nominal  coverage  is  50%  and  90%,  respectively. 
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Table  8:  Average  continuous  ranked  probability  score  (CRPS)  for  probabilistic  forecasts  of  hourly 
average  wind  speed  at  the  Stateline  wind  energy  center  in  March  through  November  2003,  month 
by  month  and  for  the  entire  evaluation  period,  in  m  •  s-1. 


CRPS 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Mar-Nov 

Persistence  forecast 

1.16 

1.08 

1.29 

1.21 

1.20 

1.29 

1.16 

1.20 

Autoregressive  forecast 

1.12 

1.02 

1.10 

1.11 

1.11 

1.22 

1.13 

1.12 

RST  forecast 

0.96 

0.85 

0.95 

0.95 

0.97 

1.08 

1.00 

0.97 

Table  9:  Mean  absolute  error  (MAE)  for  point  forecasts  of  hourly  average  wind  speed  at  the 
Stateline  wind  energy  center  in  March  through  November  2003,  month  by  month  and  for  the  entire 
evaluation  period,  in  m-s_1. 


MAE 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Mar-Nov 

Persistence  forecast 

1.60 

1.45 

1.74 

1.68 

1.59 

1.68 

1.51 

1.61 

Autoregressive  forecast 

1.53 

1.38 

1.50 

1.54 

1.53 

1.68 

1.54 

1.53 

RST  forecast 

1.32 

1.18 

1.33 

1.31 

1.36 

1.48 

1.37 

1.34 

9  suggest  a  modified  version  of  the  RST  technique  that  uses  truncated  normal  rather  than  cut-off 
normal  predictive  distributions.  This  modification  yields  small  but  consistent  improvements  in  the 
predictive  performance  of  the  RST  method,  and  we  intend  to  report  details  in  subsequent  work. 

5  Discussion 

Our  paper  addressed  the  important  issue  of  evaluating  predictive  performance  for  probabilistic  fore¬ 
casts  of  continuous  variables.  Following  the  lead  of  Dawid  (1984)  and  Diebold,  Gunther  and  Tay 
(1998),  predictive  distributions  have  traditionally  been  evaluated  within  the  framework  of  checks 
for  perfect  forecasts,  consisting  of  an  assessment  on  the  uniformity  and  independence  of  the  proba¬ 
bility  integral  transform.  We  introduced  the  more  pragmatic  and  flexible  paradigm  of  maximizing 
sharpness  subject  to  calibration.  Calibration  refers  to  the  statistical  consistency  between  the  pre¬ 
dictive  distributions  and  the  associated  observations  and  is  a  joint  property  of  the  predictions  and 
the  values  that  materialize.  Sharpness  refers  to  the  concentration  of  the  predictive  distributions 
and  is  a  property  of  the  forecasts  only. 

We  interpreted  probabilistic  forecasting  within  a  game-theoretic  framework  that  allowed  us 
to  distinguish  probabilistic  calibration,  exceedance  calibration  and  marginal  calibration,  and  we 
developed  diagnostic  tools  for  evaluating  and  comparing  probabilistic  forecasters.  Probabilistic 
calibration  corresponds  to  the  uniformity  of  the  probability  integral  transform  (PIT),  and  the 
PIT  histogram  remains  a  key  tool  in  the  diagnostic  approach  to  forecast  evaluation.  In  addition, 
we  proposed  the  use  of  marginal  calibration  plots,  sharpness  diagrams  and  proper  scoring  rules, 
which  form  powerful  tools  for  learning  about  forecast  deficiencies  and  ranking  competing  forecast 
methodologies.  Our  own  applied  work  on  probabilistic  forecasting  has  benefitted  immensely  from 
these  tools,  as  documented  in  Section  4  and  in  the  partial  applications  in  Gneiting  et  al.  (2004), 
Raftery,  Gneiting,  Balabdaoui  and  Polakowski  (2005)  and  Gneiting  et  al.  (2005).  Furthermore, 
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Figure  12:  Brier  score  plot  for  persistence  forecasts  (dashed  line),  autoregressive  forecasts  (dot- 
dashed  line)  and  RST  forecasts  (solid  line)  of  hourly  average  wind  speed  at  the  Stateline  wind 
energy  center,  in  m-s_1.  The  graphs  show  the  Brier  score  as  a  function  of  the  threshold  value.  The 
area  under  the  associated  curve  equals  the  CRPS  value  (14). 

predictive  distributions  can  be  reduced  to  point  forecasts,  or  to  probability  forecasts  of  binary 
events,  and  the  associated  forecasts  can  be  assessed  using  the  diagnostic  devices  described  by 
Murphy,  Brown  and  Chen  (1989)  and  Murphy  and  Winkler  (1992),  among  others. 

If  we  were  to  reduce  our  conclusions  to  a  single  recommendation,  we  would  close  with  a  call  for 
the  assessment  of  sharpness,  particularly  when  the  goal  is  that  of  ranking.  Previous  comparative 
studies  of  the  predictive  performance  of  probabilistic  forecasts  have  largely  focused  on  calibration. 
For  instance,  Moyeed  and  Papritz  (2002)  compared  spatial  prediction  techniques,  Clements  and 
Smith  (2000)  and  Boero  and  Marrocu  (2004)  evaluated  linear  and  non-linear  time  series  models, 
Garrat  et  al.  (2003)  assessed  macroeconomic  forecast  models,  and  Bauwens  et  al.  (2004)  studied 
the  predictive  performance  of  financial  duration  models.  In  each  of  these  works,  the  assessment 
took  place  in  terms  of  the  predictive  performance  of  the  associated  point  forecasts,  and  in  terms  of 
the  uniformity  of  the  probability  integral  transform.  We  contend  that  comparative  studies  of  these 
types  call  for  routine  assessments  of  sharpness,  in  the  form  of  sharpness  diagrams  and  through  the 
use  of  proper  scoring  rules. 

Despite  the  frequentist  flavor  of  our  diagnostic  approach,  calibration  and  sharpness  are  proper¬ 
ties  that  are  relevant  to  Bayesian  forecasters  as  well.  Rubin  (1984,  pp.  1161  and  1160)  noted  that 
“the  probabilities  attached  to  Bayesian  statements  do  have  frequency  interpretations  that  tie  the 
statements  to  verifiable  real  world  events.”  Consequently,  a  “Bayesian  is  calibrated  if  his  proba¬ 
bility  statements  have  their  asserted  coverage  in  repeated  experience.”  Gelrnan,  Meng  and  Stern 
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(1996)  developed  Rubin’s  posterior  predictive  approach,  proposed  posterior  predictive  checks  as 
Bayesian  counterparts  to  the  classical  tests  for  goodness  of  fit,  and  advocated  their  use  in  judging 
the  fit  of  Bayesian  models.  This  relates  to  our  diagnostic  approach,  which  emphasizes  the  need  for 
understanding  the  ways  in  which  predictive  distributions  fail  or  succeed.  Indeed,  the  diagnostic 
devices  posited  herein  form  powerful  tools  for  Bayesian  as  well  as  frequentist  model  diagnostics  and 
model  choice.  Tools  such  as  the  PIT  histogram,  marginal  calibration  plots,  sharpness  diagrams  and 
proper  scoring  rules  are  widely  applicable,  since  they  are  nonpar anretric,  do  not  depend  on  nested 
models,  allow  for  structural  change,  and  apply  to  predictive  distributions  that  are  represented  by 
samples,  as  they  arise  in  a  rapidly  growing  number  of  Markov  chain  Monte  Carlo  methodologies  and 
ensemble  prediction  systems.  In  the  time  series  context,  the  predictive  framework  is  natural  and 
the  model  fit  can  be  assessed  through  the  performance  of  the  time-forward  predictive  distributions 
(Smith  1985;  Shephard  1994;  Fruhwirth-Schnatter  1996;  Bouwens  et  al.  2004).  In  other  types  of 
situations,  a  cross-validatory  approach  can  often  be  used  fruitfully  (Dawid  1984a,  p.  288;  Gneiting 
and  Raftery  2004). 


Appendix 


Proof  of  Theorem  1 


Consider  the  random  variable  U  =  F(x±)zi F(x2)Z2  ■  ■  ■  F(xt)Zt  where  x\  G i, . . .  ,xt  Gt  and 
(zi,...  ,zt )'  is  multinomial  with  equal  probabilities.  The  finite  probabilistic  calibration  condition 
implies  that  U  is  uniformly  distributed.  By  the  variance  decomposition  formula, 


var(F)  =  var(F  1{U))  =  E  var  (^F  1(U)  \z\, . . .  ,ztJ 


var 


E  F 


-i 


(U)\zi,...,zt) 


The  first  term  in  the  decomposition  equals 


1  T  J  T 

-  J2  var(xt)  =  —  var (Gt) 

1  t= l  1  t= l 


and  the  second  term  is  nonnegative  and  vanishes  if  and  only  if  £((71) 


E  (Gt).  ■ 


Proof  of  Theorem  2 

For  p  €  (0, 1)  and  t  =  1,2, . . .,  put  Yt  =  1  {pt  <  p}  —  Gt  o  Ff1(p)  and  note  that  E{Yt)  =  0.  By 
Theorem  2  of  Blum  et  al.  (1963), 

f  ^  Yt  =  b  ^  (1^Pt  <  P}  ~  °  Ft~\P))  =  0 

almost  surely.  The  uniqueness  of  the  limit  implies  that  (6)  is  equivalent  to  the  probabilistic  cali¬ 
bration  condition  (3).  ■ 


Proof  of  Theorem  3 

For  x  G  M  let  q  =  F(x),  and  for  t  =  1, 2, . . .  put  qt 

1  T 

Gt{x)  =  -  !{*t  <  x} 

1  t=  l 


F(xt).  Then 

<  q}- 

1  t= i 
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By  Theorem  2  with  Ft  =  F  for  t  =  1,2, ...  we  have  that  ^  J2t= 1 1  {<lt  <  </}  — 1 ►  q  almost  surely  if 
and  and  only  if  ^  Y^t= i  o  E-1^)  — »  q  almost  surely;  hence,  marginal  calibration  is  equivalent  to 
(9).  ■ 
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